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Abstract 


There have been many recent advances 
in the structure and measurement of dis¬ 
tributed language models: those that map 
from words to a vector-space that is rich in 
information about word choice and com¬ 
position. This vector-space is the dis¬ 
tributed language representation. 

The goal of this note is to point out 
that any distributed representation can be 
turned into a classifier through inversion 
via Bayes rule. The approach is simple 
and modular, in that it will work with 
any language representation whose train¬ 
ing can be formulated as optimizing a 
probability model. In our application to 2 
million sentences from Yelp reviews, we 
also find that it performs as well as or bet¬ 
ter than complex purpose-built algorithms. 


1 Introduction 


Distributed, or vector-space, language representa¬ 
tions V consist of a location, or embedding, for 
every vocabulary word in where K is the di¬ 
mension of the latent representation space. These 
locations are learned to optimize, perhaps approx¬ 
imately, an objective function defined on the origi¬ 
nal text such as a likelihood for word occurrences. 

A popular example is the Word2Vec machin¬ 
ery of Mikolov et al. ( |2013b I. This trains the 
distributed representation to be useful as an input 
layer for prediction of words from their neighbors 
in a Skip-gram likelihood. That is, to maximize 


network classifier that takes vector representations 
for Wst and Wgj as input (see Section]^. 

Distributed language representations have been 
studied since the early work on neural networks 
(Rumelhart et al., 1986|l and have long been ap¬ 


plied in natural language processing (Morin and 


Bengio, 2005|). The models are generating much 


recent interest due to the large performance gains 
from the newer systems, including Word2Vec and 
the Glove model of Pennington et al. ( |2014 1 , ob¬ 
served in, e.g., word prediction, word analogy 
identification, and named entity recognition. 

Given the success of these new models, re¬ 
searchers have begun searching for ways to adapt 
the representations for use in document classifica¬ 
tion tasks such as sentiment prediction or author 
identification. One naive approach is to use ag¬ 
gregated word vectors across a document (e.g., a 
document’s average word-vector location) as input 
to a standard classifier (e.g., logistic regression). 
However, a document is actually an ordered path 
of locations through R^, and simple averaging de¬ 
stroys much of the available information. 

More sophisticated aggregation is proposed in 
Socher et al. ( |2011[ |2013[ ), where recursive neu¬ 
ral networks are used to combine the word vectors 
through the estimated parse tree for each sentence. 
Alternatively, Le and Mikolov’s Doc2Vec ( |2014| ) 
adds document labels to the conditioning set in ([T]) 
and has them influence the skip-gram likelihood 
through a latent input vector location in V. In each 
case, the end product is a distributed representa¬ 
tion for every sentence (or document for Doc2Vec) 
that can be used as input to a generic classifier. 


t-i-b 

X logPv(w'si I Wst) (1) 

j^t, j=t-b 

summed across all words Wst in all sentences w^, 
where b is the skip-gram window (truncated by the 
ends of the sentence) and p\;{wsj\wst) is a neural 


1.1 Bayesian Inversion 

These approaches all add considerable model and 
estimation complexity to the original underlying 
distributed representation. We are proposing a 
simple alternative that turns fitted distributed lan¬ 
guage representations into document classifiers 





















without any additional modeling or estimation. 

A typical language model is trained to max¬ 
imize the likelihoods of single words and their 
neighbors. For example, the skip-gram in ([T]l rep¬ 
resents conditional probability for a word’s con¬ 
text (surrounding words), while the alternative 


CBOW Word2Vec specification (Mikolov et ah. 


2013a I targets the conditional probability for each 
word given its context. Although these objectives 
do not correspond to a full document likelihood 
model, they can be interpreted as components in a 
composite likelihood approximation. 

Use w = [rui... wt]' to denote a sentence: an 
ordered vector of words. The skip-gram in Q 
yields the pairwise composite log likelihoocj^ 


logpv(w) = h^<\k-j\<b] logl>v{Wk\Wj). 

j=l k=l 

( 2 ) 

In another example, Jernite et al. ( 2015| l show that 
CBOW Word2Vec corresponds to the pseudolike¬ 
lihood for a Markov random field senfence model. 

Finally, given a sentence likelihood as in Q, 
documenf d = {wi, ...W5} has log likelihood 


logpv(d) = ^logpvCws). 


(3) 


Now suppose fhaf your fraining documenfs are 
grouped by class label, y G {1... C}. We can 
frain separate disfribufed language represenfafions 
for each sef of documenfs as partitioned by y; 
for example, til Word2Vec independenfly on each 
sub-corpus Dc = {di '■ Vi = c} and obfain fhe 
labeled disfribufed represenfafion map Vc- A new 
documenf d has probabilify pv^ (d) if we freaf if as 
a member of class c, and Bayes rule implies 


p{y\d) 


PVy{d)Tty 

J2cPVc{dW 


(4) 


where ttc is our prior probabilify on class label c. 

Thus disfribufed language represenfafions 
frained separafely for each class label yield 
direcfly a documenf classification rule via Q. 
This approach has a number of affracfive qualifies. 

Simplicity: The inversion sfrafegy works for any 
model of language fhaf can (or ifs fraining can) be 

'Composite likelihoods are a common tool in analysis of 
spatial data and data on graphs. They were popularized in 
statistics by Besag’s ( |1974[ |1975| l work on the pseudolike- 
lihood - p(w) ~ p(wjjw_j) - for analysis of Markov 
random fields. See Varin et al. pOTT for a detailed review. 

^See Molenberghs and Verbeke 2006| for similar pair¬ 
wise compositions in analysis of longitudinal data. 


inferprefed as a probabilistic model. This makes 
for easy implemenfafion in sysfems fhaf are al¬ 
ready engineered fo fif such language represen- 
fafions, leading fo faster deploymenf and lower 
developmenf cosfs. The sfrafegy is also infer- 
prefable: whatever infuifion one has abouf fhe dis- 
fribufed language model can be applied direcfly fo 
fhe inversion-based classification rule. Inversion 
adds a plausible model for reader undersfanding 
on lop of any given language represenfafion. 


Scalability: when working with massive corpora 
it is often useful to split the data into blocks as part 
of distributed computing strategies. Our model of 
classification via inversion provides a convenient 
top-level partitioning of the data. An efficient sys¬ 
tem could fit separate by-class language represen¬ 
tations, which will provide for document classi¬ 
fication as in this article as well as class-specific 
answers for NLP tasks such as word prediction or 
analogy. When one wishes to treat a document as 
unlabeled, NLP tasks can be answered through en¬ 
semble aggregation of the class-specific answers. 


Performance: We find that, in our examples, in¬ 
version of Word2Vec yields lower misclassifica- 
tion rates than both Doc2Vec-based classification 
and the multinomial inverse regression (MNIR) of 
Taddy (2013b I. We did not anticipate such out¬ 
right performance gain. Moreover, we expect that 
with calibration (i.e., through cross-validation) 
of the many various tuning parameters available 
when fitting both Word and Doc 2Vec the perfor¬ 
mance results will change. Indeed, we find that all 
methods are often outperformed by phrase-count 
logistic regression with rare-feature up-weighting 
and carefully chosen regularization. However, the 
out-of-the-box performance of Word2Vec inver¬ 
sion argues for its consideration as a simple default 
in document classification. 


In the remainder, we outline classification 
through inversion of a specific Word2Vec model 
and illustrate the ideas in classification of Yelp 
reviews. The implementation requires only a 
small extension of the popular gensim python 
library (Rehurek and Sojka, 20101; the ex¬ 
tended library as well as code to reproduce 
all of the results in this paper are available 
on git hub. In addition, the yelp data is 
publicly available as part of the correspond¬ 
ing data mining contest at haggle . com. See 
github . com/taddylab/deepir for detail. 


















2 Implementation 


Word2Vec trains V to maximize the skip-gram 
likelihood based on ([T]l. We work with the Huff¬ 
man softmax specification (Mikolov et ah, 2013b I, 
which includes a pre-processing step to encode 
each vocabulary word in its representation via a 
binary Huffman tree (see Figure [T]l. 

Each individual probability is then 


L{w) — 1 

pv{w\wt)= n 
i=i 

(5) 

where r]{w, i) is the node in the Huffman tree 
path, of length L{w), for word w, a{x) = 1/(1 -f 
exp[—x]); and ch(r/) G {—1, +1} translates from 
whether r/ is a left or right child to +!- 1. Every 
word thus has both input and output vector coor¬ 
dinates, and [u^(^,i) • • • u^{^„,L(^))]. Typically, 
only the input space V = [v^j • • • Viu^], for a p- 
word vocabulary, is reported as the language rep¬ 
resentation - these vectors are used as input for 
NEP tasks. However, the full representation V in¬ 
cludes mapping from each word to both V and U. 

We apply the gensim python implementation 
of Word2Vec, which fits the model via stochastic 
gradient descent (SGD), under default specifica¬ 
tion. This includes a vector space of dimension 
K = 100 and a skip-gram window of size 6 = 5. 

2.1 WordlVec Inversion 

Given Word2Vec trained on each of C class- 
specific corpora Di ... Dq, leading to C distinct 
language representations Vi ... Vc, classification 
for new documents is straightforward. Consider 



Eigure 1: Binary Huffman encoding of a 4 word 
vocabulary, based upon 18 total utterances. At 
each step proceeding from left to right the two 
nodes with lowest count are combined into a par¬ 
ent node. Binary encodings are read back off of 
the splits moving from right to left. 


the 5-sentence document d\ each sentence is 
given a probability under each representation Vc 
by applying the calculations in ([T]l and Q. This 
leads to the 5 x C matrix of sentence probabilities, 
PVc(w<j), and document probabilities are obtained 

PVc(c^) = (6) 

Einally, class probabilities are calculated via 
Bayes rule as in Q. We use priors tTc = 1/C, so 
that classification proceeds by assigning the class 

y = argmax^ pv,(d). (7) 


3 Illustration 


We consider a corpus of reviews provided by Yelp 
for a contest on haggle . com. The text is tok- 
enized simply by converting to lowercase before 
splitting on punctuation and white-space. The 
training data are 230,000 reviews containing more 
than 2 million sentences. Each review is marked 
by a number of stars, from 1 to 5, and we fit 
separate Word2Vec representations Vi ... V 5 for 
the documents at each star rating. The valida¬ 
tion data consist of 23,000 reviews, and we ap¬ 
ply the inversion technique of Section]^ to score 
each validation document d with class probabili¬ 
ties q = [gi • • • q 5 ], where qc = p{c\d). 

The probabilities will be used in three different 
classification tasks; for reviews as 


а. negative at 1-2 stars, or positive at 3-5 stars; 

б . negative 1-2, neutral 3, or positive 4-5 stars; 
c. corresponding to each of 1 to 5 stars. 


In each case, classification proceeds by sum¬ 
ming across the relevant sub-class probabilities. 
Eor example, in task a, p(positive) = qs + 
qi + q 5 . Note that the same five fitted Word2Vec 
representations are used for each task. 

We consider a set of related comparator tech¬ 
niques. In each case, some document repre¬ 
sentation (e.g., phrase counts or Doc2Vec vec¬ 
tors) is used as input to logistic regression pre¬ 
diction of the associated review rating. The lo¬ 
gistic regressions are fit under Li regularization 
with the penalties weighted by feature standard 
deviation (which, e.g., up-weights rare phrases) 
and selected according to the corrected AICc cri¬ 
teria ( Elynn et ah, 201 31 1 via the garnir R pack¬ 
age of Taddy (2014 1 . Eor multi-class tasks b-c, 
we use distributed Multinomial regression (DMR; 
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Figure 2: Out-of-S ample fitted probabilities of a review being positive (having greater than 2 stars) as 
a function of the true number of review stars. Box widths are proportional to number of observations in 
each class; roughly 10% of reviews have each of 1-3 stars, while 30% have 4 stars and 40% have 5 stars. 


Taddy 2015) via the distrom R package. DMR 
fits multinomial logistic regression in a factorized 
representation wherein one estimates independent 
Poisson linear models for each response category. 
Document representations and logistic regressions 
are always trained using only the training corpus. 

DoclVec is also fit via gensim, using the same 
latent space specification as for Word2Vec: K = 
100 and 6 = 5. As recommended in the doc¬ 
umentation, we apply repeated SGD over 20 re¬ 
orderings of each corpus (for comparability, this 
was also done when fitting Word2Vec). Le and 
Mikolov provide two alternative Doc2Vec specifi¬ 
cations: distributed memory (DM) and distributed 
bag-of-words (DBOW). We fit both. Vector rep¬ 
resentations for validation documents are trained 
without updating the word-vector elements, lead¬ 
ing to 100 dimensional vectors for each docu¬ 
ment for each of DM and DCBOW. We input 
each, as well as the combined 200 dimensional 
DM-i-DBOW representation, to logistic regression. 


Phrase regression applies logistic regression of re¬ 
sponse classes directly onto counts for short 1-2 
word ‘phrases’. The phrases are obtained using 
gensim’s phrase builder, which simply combines 
highly probable pairings; e.g., f irst_date and 
chicken_wing are two pairings in this corpus. 


MNIR, the multinomial inverse regression of 
Taddy (2013a; 2013b 20151 is applied as im¬ 
plemented in the textir package for R. MNIR 
maps from text to the class-space of inter¬ 
est through a multinomial logistic regression of 
phrase counts onto variables relevant to the class- 
space. We apply MNIR to the same set of 1-2 
word phrases used in phrase regression. Here, we 
regress phrase counts onto stars expressed numeri¬ 
cally and as a 5-dimensional indicator vector, lead- 



a(NP) 

6 (NNP) 

c(l-5) 

W2V inversion 

.099 

.189 

.435 

Phrase regression 

.084 

.200 

.410 

D2V DBOW 

.144 

.282 

.496 

D2V DM 

.179 

.306 

.549 

D2V combined 

.148 

. 284 

.500 

MNIR 

.095 

.254 

.480 

W2V aggregation 

.118 

.248 

.461 


Table 1: Out-of-sample misclassification rates. 


ing to a 6-feature multinomial logistic regression. 
The MNIR procedure then uses the 6 x p matrix of 
feature-phrase regression coefficients to map from 
phrase-count to feature space, resulting in 6 di¬ 
mensional ‘sufficient reduction’ statistics for each 
document. These are input to logistic regression. 

WordlVec aggregation averages fitted word rep¬ 
resentations for a single Word2Vec trained on all 
sentences to obtain a fixed-length feature vector 
for each review {K = 100, as for inversion). This 
vector is then input to logistic regression. 


3.1 Results 


Misclassification rates for each task on the valida¬ 
tion set are reported in Table 1. Simple phrase- 
count regression is consistently the strongest per¬ 
former, bested only by Word2Vec inversion on 
task 6. This is partially due to the relative strengths 
of discriminative (e.g., logistic regression) vs gen¬ 
erative (e.g., all others here) classifiers: given 
a large amount of training text, asymptotic effi¬ 
ciency of logistic regression will start to work in 
its favor over the finite sample advantages of a 
generative classifier (Ng and Jordan, 2002t Taddy, 


2013c I. However, the comparison is also unfair 
to Word2Vec and Doc2Vec: both phrase regres- 




































sion and MNIR are optimized exactly under AICc 
selected penalty, while Word and Doc 2Vec have 
only been approximately optimized under a sin¬ 
gle specification. The distributed representations 
should improve with some careful engineering. 

Word2Vec inversion outperforms the other doc¬ 
ument representation-based alternatives (except, 
by a narrow margin, MNIR in task a). Doc2Vec 
under DBOW specification and MNIR both do 
worse, but not by a large margin. In contrast to 
Le and Mikolov, we find here fhat the Doc2Vec 
DM model does much worse than DBOW. Re¬ 
gression onto simple within- document aggrega¬ 
tions of Word2Vec perform slightly better than any 
Doc2Vec option (but not as well as the Word2Vec 
inversion). This again contrasts the results of Le 
and Mikolov and we suspect that the more com¬ 
plex Doc2Vec model would benefit from a careful 
tuning of the SGD optimization routinej^ 

Looking at the fitted probabilities in detail we 
see that Word2Vec inversion provides a more use¬ 
ful document ranking than any comparator (in¬ 
cluding phrase regression). For example. Figure 

shows the probabilities of a review being ‘pos¬ 
itive’ in task a as a function of the true star rat¬ 
ing for each validation review. Although phrase 
regression does slightly better in terms of misclas- 
sification rate, it does so at the cost of classifying 
many terrible (1 star) reviews as positive. This oc¬ 
curs because 1-2 star reviews are more rare than 3- 
5 star reviews and because words of emphasis (e.g. 
very, completely, and ! ! !) are used both 
in very bad and in very good reviews. Word2Vec 
inversion is the only method that yields positive- 
document probabilities that are clearly increasing 
in distribution with the true star rating. It is not dif¬ 
ficult to envision a misclassification cost structure 
that favors such nicely ordered probabilities. 

4 Discussion 

The goal of this note is to point out inversion as an 
option for turning distributed language representa¬ 
tions into classification rules. We are not arguing 

^Note also that the unsupervised document representa¬ 
tions - Doc2Vec or the single Word2Vec used in Word2Vec 
aggregation - could he trained on larger unlaheled corpora. A 
similar option is available for Word2Vec inversion: one could 
take a single Word2Vec model trained on a large unlabeled 
corpora as a shared baseline (prior) and update separate mod¬ 
els with additional training on each labeled sub-corpora. The 
representations will all be shrunk towards a baseline language 
model, but will differ according to distinctions between the 
language in each labeled sub-corpora. 


for the supremacy of Word2Vec inversion in par¬ 
ticular, and the approach should work well with al¬ 
ternative representations (e.g.. Glove). Moreover, 
we are not even arguing that it will always outper¬ 
form purpose-built classification tools. However, 
it is a simple, scalable, interpretable, and effective 
option for classification whenever you are working 
with such distributed representations. 
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